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Abstract 

Approximate Bayesian computation (ABC) is an approach for sampling from an 
approximate posterior distribution in the presence of a computationally intractable 
likelihood function. A common implementation is based on simulating model, param- 
eter and dataset triples, (m,9,y), from the prior, and then accepting as samples from 
the approximate posterior, those pairs (m, 6) for which y, or a summary of y, is "close" 
to the observed data. Closeness is typically determined though a distance measure and 
a kernel scale parameter, e. Appropriate choice of e is important to producing a good 
quality approximation. This paper proposes diagnostic tools for the choice of e based 
on assessing the coverage property, which asserts that credible intervals have the correct 
coverage levels. We provide theoretical results on coverage for both model and param- 
eter inference, and adapt these into diagnostics for the ABC context. We re-analyse 
a study on human demographic history to determine whether the adopted posterior 
approximation was appropriate. R code implementing the proposed methodology is 
freely available in the package abc. 
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1 Introduction 



For a given model, m, Bayesian inference for unknown parameters 6, and given observed data 
Hobs, updates prior beliefs 7r(8\m) through the likelihood function n(y o bs\0, w). This produces 
the posterior distribution ir(9\y ohs , m) = Tr(y bs\@, m )' n '(9\'m)/7r(y \ )S \m), where vr(?/ obs |m) = 
J 7r(y bs\9, m)'K(6\m)d9 is the integrated likelihood for model m. Similarly, Bayesian inference 
for a discrete set of models m e {1, 2, ... , M} updates a prior mass function p(m) to posterior 
weights p(^|y bs) 7r (y bs\ m )p{ m ) • Here the joint posterior of parameter and model is given 
by tt(6, m\y bs) oc ir(y h B \Q, m)it(6\m)p(m). Increased usage of Bayesian inference in recent 
decades has been built on powerful algorithms, such as Markov chain Monte Carlo, which 
make use of repeated evaluation of the likelihood function(s). 

Approximate Bayesian computation (ABC) refers to a family of algorithms which per- 
form an approximate Bayesian inference when numerical evaluation of the likelihood is not 
feasible, but where it is possible to simulate from the model(s) y ~ 7t(-|0,to). ABC has 
become a popular tool for the analysis of complicated models in a wide range of challenging 



applications. See e.g. Beaumont (2010); Bertorelle et al. (2010); Csillery et al. (2010); Marin 



et al. (2012) and Sisson and Fan (2011) for overviews of methods and applications. 



A common, importance sampling-based implementation of ABC, expressed for the multi- 
model setting, is given by the following: 



ABC importance sampling 

1. For i = 1, ... ,N: 

Sample a model and parameter from the prior (6i,mi) ~ n(9\m)p(m). 
Simulate data from model uii as yi ~ n(y\9i, mi). 

2. Weight each sample (y^Ou mi) by Wi oc K e (\\yi - y obs \\). 



Algorithm 1: An ABC importance sampling algorithm, based on a single large sample of size N, 
incorporating model choice and parameter inference. 

Here K e (u) = K(u/e)/e is a standard smoothing kernel with scale parameter e > 0, 
and || • || is a distance measure e.g. Euclidean. For this article, for simplicity and w.l.o.g. 
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we consider K e (u) to be the uniform kernel U(—e,e), so that step [2] above corresponds to 
selecting (i.e. with non-zero weights) those samples (yi,9i,rrii) for which \\yi — y h s \\ < £■ 
Note that when p(m) is small, the above algorithm may have large Monte Carlo error in 
estimating p(m\y b s ). This is often avoided by using a uniform prior mass function in place 
of p{m) as a computational device, and then reweighting each model appropriately (see e.g. 



Grelaud et al.| ( |2009[ )). 

The output of the above algorithm is a sample of parameter vectors (8i,rrii) from an 
approximation to the posterior 

n ABC (9, m\y ohs ) octt (9\m)p(m) J ir{y\9, m)K e (\\y - y oba \\) dy, (1) 

where it can be seen that ttabc(9, m\y i, s ) ps ir(9,m\y b s ) following standard conditional 
density estimation arguments. There are two sources of approximation error in the sample 
representation of Q from Algorithm [TJ both of which are influenced by e. The first is 
the discrepancy between 7iABc(9,m\y h s ) and tc(9, m\y b s ). These are equal in the limit 
e — > 0, however the approximation deteriorates as e is increased. The extreme result is 
lim^oo 71abc(9, m\y b s ) = Tr(9\m)p(m). That is, all proposed samples (9,m) are accepted, so 
that ABC targets the prior distribution and all information from the data is lost. Secondly, 
ttabc^ m\y bs) is approximated by a finite sample whose size reduces as e — )• (in other 
ABC algorithms this corresponds to extremely low acceptance rates). Indeed the sample 
size typically reduces to zero for continuous data. In effect, e controls a form of the usual 



bias- variance trade-off (Blum, 2010a). 



Many approaches have been proposed to reduce the approximation error. One is to 
replace \\y — y b s \\ by ||s — s b s || where s = S(y) is a vector of summary statistics. Low 
dimensional but informative summary statistics can greatly improve inferential accuracy, 



even at the price of potential information loss (Blum et al. 2013). Using summary statistics 



introduces another level of approximation, as then ABC approximates 7r(#,m|s bs) rather 



3 



than tt(8, m\y ohs ). A second approach is to post-process the sample from 7Tabc(#> m \Uobs) 
with e > so that is approximately transformed to a sample with e = 0. Termed regression- 
adjustment, for within-model parameter inference this takes the form of linear or non-linear 



regression-based transformations of 9i (Beaumont et al. , 2002 Blum and Frangois, 2010 



Blum et al. 2013). For the adjustment of model probabilities, post-processing can be per- 



formed by multinomial regression (Beaumont, 2008). 



1.1 Diagnostics for ABC 

This paper addresses two related open questions about ABC in practice. Firstly, is it possible 
to validate the ABC approximation of the posterior, 7Tabc(#> 'Hz/obs) ~ n (.@y m \yobs) (or 
ttabc(#, ^|s bs) ~ tt(0, m|s bs) when using summary statistics), as accurate? Secondly, how 
should e be chosen? Typically e is commonly chosen in an ad-hoc manner, although several 



authors (Bortot et al. 2007 Ratmann et al. 2009 Blum, 2010b Faisai et al., 2013) have 



suggested approaches where e is estimated as part of an extended model. 

In this paper we approach the question of the accuracy of the ABC posterior approxima- 
tion by examining whether the coverage property holds (described below). By numerically 
evaluating adherence to the coverage property through diagnostic statistics, we are able 
to determine the likely accuracy of the ABC approximation for a range of e values. In 
favourable circumstances this allows the user to choose as large a value of e as possible 
(for computational efficiency) for which coverage approximately holds. Alternatively, the 
coverage diagnostics may reveal that large approximation error remains for any choice of e. 

Our approach is based on credible intervals, a standard Bayesian method to give interval 
parameter estimates. Consider the case of within-model parameter inference for a fixed 
model tuq. An a% credible interval for a univariate parameter 9 is an interval / with the 
property that Pr(# e I\y bs) — at/100. Suppose that credible intervals are constructed from 
ABC output for data simulated from a known parameter value, 9q. Roughly speaking, the 
coverage property asserts that these intervals have the claimed probability of containing 6q. 
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To fully define the property we must be specific about the distribution of 9 . This is discussed 
later, where we give theoretical results supporting a particular choice for our purpose. We 
also describe how this definition can be extended to model inference problems. 



An equivalent condition to the coverage property was provided by Cook et al. (2006). 
Namely, the p-values of 9q < 9 (or 9q > 9) within the posterior estimates must have a 
£7(0, 1) distribution. Accordingly, the coverage property may be tested for a particular value 
of e, by repeatedly performing ABC for many choices of 9q and associated pseudo-observed 
data y ~ Tr(y\9o,mo), computing p-values, and then applying standard tests for uniformity. 
We provide a similar equivalent condition and a test of the coverage property for model 
probabilities, in addition to providing a computationally efficient process to compute the 
test statistics, for both within- and between-models. 

There is a sizeable literature related to the coverage property. Bayesian work includes 



determining the correctness of complex Bayesian simulation algorithms (Cook et al. , 2006), 



the post-processing of within-model ABC output (Menendez et al. 2012) and the validation 



of ABC analyses in the single model setting (Wegmann et al. 2009 2010 Aeschbacher et al. 



2012). A recent overview of work from a frequentist perspective is provided by Gneiting 



et al. (2007). However, this work has the somewhat different aim of determining consistency 



between statistical predictions and a sequence of observed outcomes (e.g. weather forecasts 
and meteorological data). Despite the difference in aims, the primary ideas behind our 



statistical tests of coverage go back to the frequentist literature: Dawid (1984) for continuous 



parameters and Seillier-Moiseiwitsch and Dawid (1993) for model choice. The approach we 
develop in this article is similar to the ABC papers mentioned above. Our contribution 
here, is to explain and justify the theoretical basis and methodology of coverage in more 
detail, to improve this methodology where needed, and to extend these ideas to the hitherto 
unconsidered realm of model inference for ABC. 

The remainder of the paper is structured as follows: Section [2] defines the coverage 
property for both parameter and model inference, and gives some theoretical results. The 
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methodological details of the resulting diagnostics are described in Section [3j including an 
algorithm and discussion of diagnostic statistics and tests. Section [| presents a simulated 
example to illustrate the methods and justify some implementation choices, followed by a re- 
analysis of a study into human demographic history to determine whether a reliable posterior 
approximation was obtained. Finally, Section [5] concludes with a discussion. 



2 Coverage 

We investigate whether the ABC approximation 7Tabc(#, "Hl/obs) (or 7Tabc(#, ^|sobs)) is a 
good representation of the posterior ir(9, m\y i, s ) (or n(9, m|s b s )) by testing the coverage 
property. For inference on a continuous scalar parameter, 9, an informal definition is that 
a given credible interval based on 9\y , where yo ~ n(y\9 Q , mo) for fixed m , should contain 
the true parameter, 9q, the appropriate proportion of times. This Section presents a more 
precise definition, a discussion of the property's consequences, and results on how it can be 
tested. We also describe a version of the property suitable for a model choice setting. We 
notationally work with y rather than s throughout this Section. 



2.1 Parameter inference 



We define the coverage property for the case of a continuous scalar parameter 9 for a fixed 



model mo, where for the remainder of Section 2A_ we drop all notational dependence on mo- 
For multivariate parameter vectors, our method will examine each parameter separately. 

The informal definition above is based on analysing data y$ simulated from known pa- 
rameter values 9q. To formalise the property we introduce a distribution for these, H(9q, yo), 



with densities associated with H denoted by h. A natural choice, used by Cook et al. (2006), 



Wegmann et al.| Q2009[ |2010[ ) and |Aeschbacher et al.| ( |2012[ ), is h(9 ,y ) = n(y \9 )7r(9 ); 
draw the parameters from the prior, and the data from the model of interest conditional on 
this. We present an argument in favour of an alternative choice for the ABC setting below. 
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Let g(9\y) be a density approximating the posterior given data y. From this, credible 
intervals of any level can be constructed. Suppose C(y, a) is a credible interval of level a%. 
We say that g satisfies coverage with respect to H if the coverage level of such an interval 
is a when analysing data generated from H (i.e. Pr(# G C(y ,a)) = a), for any choice of a 
and C. More formally, we have: 

Definition of coverage property: Let g(9\y) be a density approximating the univariate 
posterior n(9\y), and G y {6) be the corresponding distribution function. Consider a function 
B(ot) 0: [0, 1] defined for a G [0, 1] such that the resulting set has Lebesgue measure a. 
Let C(y,a) = G~ 1 [B{a)] and H(9 ,y ) be the distribution function for (6*o,yo)- We say g 
satisfies the coverage property with respect to distribution if(# ,2/o) if for every function B 
and every a G [0, 1], Pr(# G C(y , a)) = a. 

There are two requirements for the coverage property to be a useful criterion to determine 
how well g(0\y) approximates the posterior -n{6\y). These requirements will determine some 
characteristics of H(6 ,y ). Firstly, it should hold when g(9\y) = n(9\y). 

Result 1 The posterior, n(9\y), satisfies coverage with respect to any distribution H(9 ,y Q ) 
with conditional density h(9 \y ) = n(9 \y ). Proof in Appendix. 

The second requirement is that the coverage property should avoid false positives: it 
should not hold when g{9\y) ^ -n(9\y). However, coverage can hold when g(9\y) = n(9) 
equals the prior distribution, when 9 ~ it {9) is also drawn from the prior. 

Result 2 The prior, rc{9), satisfies coverage with respect to any distribution if (#0,2/0) with 
marginal density h(9 ) = ti(9 ). Proof in Appendix. 

The above results demonstrate that the choice h(9 ,y ) = n(9 Q ,y ) (where n(9 ,y ) = 
^(^o 12/0)^(2/0) = 7r (2/o|#o) 7r (#o)) leads to the coverage property holding for both the prior and 
posterior distributions. The false positive of the prior is particularly unwelcome in the ABC 
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context, as it corresponds exactly to the ABC approximation ttabc(0|z/) f° r e — )• oo. (The 
prior also coincides with the ABC posterior approximation when s = S(y) has no information 
for 9 under the model, so that 7Tabc(#|s) = tc(9\s) = 7r(#), although this is due to Result [T] 
rather than Result [2j See Section [5] for more discussion.) 

To avoid this we propose the alternative choice of h(9o,yo) oc n(9o, 2/oMl/o £ A]. That 
is, the distribution 7r(#o>?/o) truncated to require that the data lie within some subset A. 
This preserves h(9o\yo) = ^(^o\yo), so coverage holds for the posterior, but it typically alters 
^(^o) so that coverage not longer holds for the prior (i.e. h(9o) ^ n(9 )). We examine some 
convenient choices of A in Section [3] In this manner, we are aiming to evaluate coverage 
for datasets similar to y bs, rather than the much stronger context of coverage holding for 
all datasets. Note that the above results do not prove that the posterior 7c(9\y) is the only 
distribution to satisfy coverage with respect to our choice of H. However, we are unaware 
of any other such distributions that are likely to arise in the ABC context. 

An equivalent condition to the coverage property, which is easier to test, is the following: 

Result 3 Let H be the distribution function of (#o;2/o)- Define po = G yQ (9o), where G y {9) 
is the distribution function of 9 under g(9\y). Coverage holds with respect to H iff 

Po~U(0,l). (2) 

Proof in Appendix. 



A similar result was given by Cook et al. (2006), who proved that under coverage (with 



respect to the distribution function h(9o,yo) = Tr(yo\9 )n(9 )) the empirical distribution of 
p converges to U(0, 1). 

2.2 Model inference 

As for parameter inference, the definition of the coverage property for model inference re- 
quires us to specify the distribution of the known parameter values m and yo through 



if (mo, 2/o )) which can be considered a marginal distribution of H(m ,9 ,y ). Note that for 
model inference, H, its derivatives (h), and the mass function g(m\y) approximating the 
posterior p{m\y) are discrete functions. 

Formalising the intuitive notion of coverage for the case of model choice faces the difficulty 
of interpreting the idea of a credible interval for a discrete parameter. We firstly illustrate 
our definition with an example, and then formalise it below. Suppose that given data y 
simulated from model m G {1,2,3}, estimated posterior probabilities are 0.7,0.2 and 0.1. 
This could be viewed as defining three credible intervals; a 70% credible interval that m—1 
etc. We would like to investigate coverage in the following sense: given a 70% interval for 
m — 1 produced by some (mo,yo) pair, there is a 70% probability of it containing m . A 
technical difficulty is that the probability of a pair producing a 70% credible interval is 
typically zero. This difficulty can be avoided by requiring the following condition to hold for 
every a < b e.g. a = 0.69 and b = 0.71: Consider all y such that the estimated probability of 
m — 1 is between a and b. Conditioning on this, the probability of m = 1 also lies between 
a and b. 

Definition of coverage property: Let g(m\y) be a mass function approximating the 
posterior and G y {m) the corresponding distribution function. Given / = [a, b] C [0,1], 
define A(m,I) = {y\g{jn\y) G /}. We say g satisfies the coverage property with respect to 
distribution H(m , y ) if, for all i G {1,2,..., M} and /, either 

Pr(y G A(i,I))=0, or (3) 
Pr(m = i\yo G A(i,I)) G I. (4) 

Similar arguments to the parameter inference case show that the posterior satisfies cov- 
erage when h(m Q \y ) = 7r(mo|yo), but the prior satisfies coverage when h(m ) = p(m ). 

Result 4 The posterior, p(m\y), satisfies coverage with respect to any distribution H(m , y ) 
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with conditional mass function h(mo\yo) = p(m Q \y ). Proof in Appendix. 

Result 5 The prior, p(m), satisfies coverage with respect to any distribution H(m ,y Q ) with 
marginal mass function h(m ) = p(m ). Proof in Appendix. 

This means that the natural choice of h(m , y ) = p(m , y ) (where p(m , y ) = p(?/ol m o)Po( m o) — 
p(mo\yo)p(yo)) is not suitable. As before, our proposed solution is to truncate this distribu- 
tion on (m ,y ), so that h(m ,y ) = p(m ,y )I[y e A). 

As before, the above definition of coverage for model inference is not directly testable. 
Below we give an equivalent (under weak technical conditions) form which is. 

Result 6 Define z$ = g(i\yo)- Assume that for alii 6 {1,2, ...,M}, the measure 
on Zq induced by the distribution H(m ,y ) is not a singular continuous distribution with 
respect to Lebesgue measure. Coverage then holds with respect to H(mo,yo) if and only if, 
for all i 

Pr(m = i\zq = w) = w (5) 
holds for almost all w with respect to . Proof in Appendix. 



3 Method 

In this Section we discuss how to construct diagnostics based on the coverage property. In 
principle, this is simply a matter of repeatedly constructing an ABC posterior approximation, 
7^ABc( m , Q\yo), for known values of (mo, #o>2/o) ~ H(mo,9 ,y ), computing p-values and 
estimated model probabilities, and then testing whether the conditions ^ and ^ hold. 
This can be repeated for many e values until a suitable choice is found. However, simulating 
datasets for a single ABC analysis is typically computationally expensive. As such, we reuse 
the same simulations for each ABC analysis, along the lines of Algorithm [TJ as is common 



for ABC diagnostics (e.g. Blum et al. (2013)). Simulations in Section H indicate this makes 



little difference to the results. 

10 



In the following, we first present the full algorithm, including how to generate (m , ,y ) ~ 
H(mo,0o,yo)- We then describe several test statistics and diagnostic plots to allow an as- 
sessment of whether the conditions ^ and ^ hold. As in the previous Section, the details 
are presented in terms of y rather than s for notational simplicity. R code to implement these 



methods has been made available as part of the abc package (Csillery et al. , 2012). 



3.1 Algorithm 

ABC coverage diagnostics 

1. Determine integers N > 0, c > and candidate values of e: t\ > €2 > ■ ■ ■ e q > 0. 

2. Simulate a set U = {(mi,6i,yi)\i = l,...,N} of independent realisations of (m,0,y) from 
n(y\6, m)ir{0\m)p{m). 

3. Select VCD containing the c realisations that minimise \\yi — y bs 1 1 - 

4. For each (mo, 0q, yo) & V and for j = 1, . . . , q: 

(a) Let W = U\(m o ,0o,2/o)- 

(b) Find the subset of W such that \\yi — y ha\\ < ( j- 

(c) (Optional) Perform regression-adjustment post-processing. 

(d) Record p-values and estimated model probabilities. 

5. Construct plots of diagnostic statistics versus e. 



Algorithm 2: An algorithm to diagnose coverage for ABC as a function of kernel scale parameter 
e > 0. In the case of a single model, modify (m,0,y) — > (0,y) and n(y\9, m)Tr(9\m)p(m) —> 
ir(y\9)ir(6) in the obvious way. 

The algorithm for diagnosing coverage of the ABC approximation ^ABc{ m i Q\Vohs) ( or 
7TABc(0|2/obs)) is presented in Algorithm[2j The set V is a sample of size c from the distribution 
h(m ,9 ,y ) = n{y \9 ,m )'K(e \m )p('m )I[yo E A], where A = {y : ||y-y obs || < 5} for some 
S determined by c. Each element of V is taken as the known values (m , #o> Do) m turn, and 
the ABC posterior approximation estimated for a range of kernel scale parameters, e. 

Increasing c, the number of known values of (mo, 9i, yo), will improve the power of the 
tests of coverage. However the tradeoff is a greater computing cost, and that A becomes less 
concentrated around y bs, so the risk of the prior satisfying the coverage property increases. 
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The final choice is left to the user. However, we note that c can be increased (or decreased) 
based on preliminary findings. We investigate various values of c by simulation in Section |4j 
and based on this suggest c = 200 default. 



3.2 P- values and model probabilities 

For scalar 9, we require a p- value estimate of (|2j) under Result |3j based on a posterior sample 
0«, 9( 2 \ 0< n ). Here, we use p = (1 + ££ =1 1(9® < 9 ))/{2 + n), which is equivalent to 
the posterior mean for the binomial probability that 9j < 9$ under a uniform prior. This 
choice eliminates the occurrence of extreme values (p = or 1), which can overly influence 
some test statistics. For multivariate 9, we record a p-value estimate for each parameter. 

Given a posterior sample of model indicators mS 2 \ . . . , m^ n \ a straightforward esti- 
mate of the posterior probability of model i is the proportion which equal this: g(m = i\yo) = 
Y^j=i^-( m = i)/ n - Alternatively, regression-adjusted post-processing produces estimated 



posterior model probabilities directly (Beaumont, 2008). 



Removing an element of U in step 4a of Algorithm [2] can slightly bias the estimated ABC 



model probabilities. For example, let d = Y^i=i^[ m i = !]• For e = oo, g(m = l\yo) = 
(d — l)/(n — 1) for m = 1 and d/(n — 1) otherwise. This dependence of g(-\yo) on m 
causes unwanted behaviour in some diagnostics. To mitigate this, we reweight the model 
probability estimates to use the empirical prior model weights from U rather than W. That 
is, given estimated model probabilities g(rrii\yo) we adjust these to g(m = i\yo) oc g(m = 
i\yo)hi(U)/hi(W), where hi(-) gives the proportion of realisations from model i in the supplied 
set. No similar correction of parameter estimates was found to be necessary. 



3.3 Diagnostic statistics 

For each parameter and value of e we will have c replicated p-values, P\,P2, ■ ■ ■ ,Pc- We 
treat these as independent, although there may be mild dependence induced by Algorithm 
[2j Under the coverage property these will be distributed as U(0, 1) (Result [3]). There are a 
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number of tests for uniformity. Cook et al. (2006) used the diagnostic statistic 



a 2 = j>->^ 



(6) 



where $ is the standard normal distribution function. When the Pi values are independent 
£7(0, 1) draws, X 2 ~ Xc> which allows the calculation of a p-value for coverage (we report 
the p- value for a two-tailed test). We note that this statistic is unchanged if some pi values 
are replaced with 1 — pf, it does not test for symmetry of the distribution around 0.5. This 
can cause problems in practice. An example based on real data is the top left histogram of 
Figure S5 which displays pi values that are clearly not uniform but receive a p- value of 0.75 
from this diagnostic statistic. 



An alternative used by Wegmann et al. (2009, 2010) is the Kolmogorov-Smirnoff test 
statistic 

Y = sup|F c (x)-F(x)|, (7) 

X 

where F c (x) is the empirical distribution function of Pi,P2, ■ ■ ■ ,Pc and F(x) is the U(0, 1) 
distribution function. The distribution of Y if pi ~ U (0, 1) is known in exact and asymptotic 



forms (Durbin, 1973) and can be used to calculate a p- value for coverage (using a one- 
tailed test). Our pi values are not drawn from continuous distributions, but rather discrete 
distributions based on the number of posterior samples. However asymptotic p-values based 
on the continuous distribution can still be calculated and will be of the correct order of 
magnitude, which suffices for their purpose as a rough diagnostic guide. If more accurate 
p-values are required, Monte Carlo estimation is possible but more time consuming. 

For model inference diagnostics, we focus on the binary case of model m = i and model 
m i, for each i E {1,...,M}. For each e, Algorithm [2] is run on a sequence of yo 
values, 2/0,1) Vo,2, ■ ■ ■ , Vo,c, generated from m values, m 01 , m ,2, • • • , rriQ >c , to produce z values, 
Z\,Zq, . . . , z c , where Zj is the estimated probability of model i: PrABc(^ = i\yo,j)- Etefine 



qj = I(m , : 



Following Result [6j we wish to test the coverage hypothesis that qj 
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Bernoulli (zj), where all qj values are assumed independent, as before. 
A simple diagnostic statistic is the proportion of times model i occurs, 

c 

U = c- 1 Y,dr 

A central limit theorem holds for the distribution of U under the null hypothesis of coverage, 
conditional on the Zj values. However, this can be a poor approximation when some Zj values 
are close to or 1. Instead, we construct the null distribution by Monte Carlo methods to 
estimate the p- value for coverage (using a two-tailed test). To improve the stability of the 
resulting p-values, we use the same random seed across different e values. 

A drawback of U is that highly unlikely qj values, such as qj = 1 when Zj = 10~ 6 provide 
strong evidence against coverage, but do not contribute more to U . As an alternative, we 
can consider the log-likelihood, 

c 

V = l°g*i + (1 - Qi) Ml - zj)] , (8) 

with p-values for coverage (using a two-tailed test) again calculated by Monte Carlo simula- 
tion. A drawback of this statistic is that V is constant regardless of the qi values if Zj = 0.5, 
and so coverage cannot be rejected. Also note a similar statistic is to use the log-likelihood 
of c independent discrete random variables, W = YTj=\ l°gP r ABc( m = m oj|2/o,j)- This tests 
coverage of all models. 

A problem with these statistics is that they can be insensitive to departures from coverage 
which vary in nature with qj. It is difficult to define a statistic which is flexible enough to 



detect such problems for all possible (qj,Zj) sequences. Seillier-Moiseiwitsch and Dawid 



(1993) present a portmanteau statistic combining tests on a partition chosen for a specific 
dataset, but in our experience such a statistic is hard to adapt to work generally. As such, 
we advise that checking diagnostic plots is particularly important. Should these show poor 
performance of general purpose diagnostic statistics, they may motivate a better choice 
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specific to the problem of interest. 



3.4 Diagnostic plots 

For parameter inference, many standard diagnostic plots can be used to assess uniformity of 
Pi,P2, ■ ■ ■ iPci suc h as histograms and probability plots. 

For model inference, we present a diagnostic plot, an example of which is shown in 
Figure [7j Based on an equally-spaced partition of [0, 1] into subintervals, S, we estimate 
Pr(gj = l\zj G S) for each subinterval, by Bayesian inference for a binomial rate using a 
uniform prior. The diagnostic plot displays each posterior mean and 95% credible interval. 
Under coverage, each credible interval should include some of the associated Zj interval with 
high probability. The plot illustrates whether the coverage property appears to hold for 



each interval individually. This approach is similar to the Seillier-Moiseiwitsch and Dawid 



(1993) portmanteau statistic described above, but without the need to combine the results 
into a single statistic. Indeed, they also propose a "coverage plot," similarly plotting point 
estimates for several partitions. 



4 Analyses 

4.1 Simulated example 

We now examine how our coverage diagnostics perform in a simple simulated example. 
We consider that 100 data points are drawn independently from either a N(0, 1) or a 
gk(0, 1, g, 0) model, which are equally likely a priori. (For details on the g-and-k distri- 
bution see e.g. Drovandi and Pettitt 2011[ ) Inference can be split into binary model choice, 



and inference for the unknown parameter, g > 0, which has a [7(0,4) prior. Observed data, 
y bs, was drawn from the g-and-k model with g = 0.2. We base our ABC analysis on the me- 
dian and upper and lower quartiles as summary statistics, so we are interested in determining 
how well the ABC approximation 7TABc( m , #|s bs) represents the posterior 7r(m, 9\s). For the 
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analyses in this paper, we use weighted Euclidean distance \\a — b\\ = ~ bj^/v 2 ] 1 / 2 , 

where v 2 is the prior predictive variance of the j-th summary statistic, estimated from the 
set W in Algorithm [2] 

For analysis, we construct the set U from 2 x 10 6 simulated (m, 9, y) triples, half from 
each model. Figure [I] shows coverage diagnostic p-values from the statistics U, V, X 2 and 
Y as a function of e, and for parameter (left panels) and model (right panels) inference 
respectively. For parameter inference, we only present results from the g-and-k model. The 
top panels show diagnostics when the set of known values, V = {(m , #0, so)}, is a random 
sample of size c = 200 from U (i.e. the prior), and the middle and bottom panels when V 
consists of the c = 200 samples with s closest to s b s - 

The top panels in Figure [l] support Results|2]and[5j in that when (m , 9 , s ) ~ tt(s\9, m)7i(9\m)p(m) 
are drawn from the prior, then coverage holds both for the prior (i.e. large e values) and the 
posterior (i.e. small e values). When (m ,So,so) ~ tt (s 1 6>, m)7i(9\m)p(m)I(\\so — s bs|| < 5) 
are drawn from the truncated prior (Figure [TJ middle panels), then coverage does not hold 
for the prior. Note that the statistic U does not detect any deviation from coverage here, 
as discussed in Section ^3 Figure [T] also illustrates our earlier point, particularly in the 
case of parameter inference, that requiring coverage to hold for the prior is more demanding 
than requiring coverage to hold for the truncated prior. This is evidenced by the upturn in 
p-values only occurring at lower e values when V is drawn from the prior, compared to the 
relatively larger values of e when V is drawn from the truncated prior. 

For comparison, the procedure with V drawn from the truncated prior was repeated by 
resimulating W for each ABC analysis, thereby removing any effects of reusing (m, 9, s) 
samples, albeit at far greater computational expense. Figure [l] (bottom panels) shows that 
the results are nearly identical to those obtained using Algorithm [2j Further, we repeated 
our analysis with c = 100 and c = 500, and obtained qualitatively similar results (see Figures 
SI and S2 in the Supplementary Information), suggesting that the choice of c is not crucial 
to drawing the correct inferences. 
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Figures [2] and [3] show diagnostic plots to investigate coverage in more detail for e = 
0.28, 1.5, 13. These again demonstrate that coverage approximately holds for large e when 
V is drawn from the prior, but not from the truncated prior. They also provide insight 
into disagreements between the statistics within some panels in Figure [TJ For parameter 
inference, in the top left panel of Figure [TJ there is less than clear agreement about whether 
coverage roughly holds for the smallest e values. The top left panel of Figure [2] confirms 
that the p-value histogram has a non-uniform shape in this case. For model inference, the 
top right panel of Figure [T] suggests no deviation from coverage for any e for the U statistic. 
However, the top centre panel of Figure [3] illustrates that this is not correct. 

Our interpretation of these results is that e < 0.28 is sufficient to achieve approximate 
coverage for both parameter inference and model selection in the case where V is drawn from 
the truncated prior. Coverage does not hold (for small e) when V is drawn from the prior, 
which represents a stricter condition. However, the former case relates more to the dataset 
and analysis of interest. 



4.2 Application in human demographic history 



Sjodin et al. (2012) detail an ABC analysis of genetic data to choose between three demo- 
graphic models of human history: null, bottleneck and fragmentation models. Each model 
contains 9 unknown parameters: b, d, n, Nq, Na, Nb, Tb, T g and Td ur - Their analysis 
used 10 5 simulations from each model, and they accepted the 0.5% of simulations min- 



imising \\s — Sobs || , corresponding to e = 1.36. Sjodin et al. (2012) then used regression 
post-processing for parameter and model inference. We use the same experimental setup 
to evaluate the implemented value of e, and also to determine whether regression- adjusted 
post-processing improved the results. 

Figure [4] shows parameter inference diagnostics for d, N , Na and TJ,, without regression 
post-processing (see Supporting Information Figure S3 for the same plot for the remaining 
parameters). There is occasional significant disagreement between the statistics. However, 
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overall it is apparent that coverage is not attained for any e, apart from perhaps Td ur . P- 
value histograms for e = 1.36 (Figure S5) confirm that in most cases there is clear deviation 
from coverage. For model choice, both statistics agree that coverage is not attained (Figure 
6j left panels) and diagnostic plots for e = 1.36 confirm this (Figure [7j left panels). 

Regression post-processing was performed by conditional heteroskedastic, local-linear re- 



gression (Blum and Frangois, 2010) for parameter inference, and multinomial logistic regres- 



sion (Beaumont, 2008) for model inference. The regression post-processing greatly improves 



the results. The model inference statistics (Figure |6J right panels) now suggest that coverage 
holds for any choice of e investigated. The same is true of many parameters, although for 
others, coverage appears to hold only for smaller e (Figure [5j Figure S4). Diagnostic plots 
for parameter and model inference (Figure S6 and [7J right panels) produced for e = 1.36 
suggest that coverage is approximately achieved, except for some small concerns remaining 
for some parameters (e.g. Nb and Td ur ). 

On the whole, and with only a few minor caveats, our investigation largely validates the 



choice of e, and the use of regression post-processing by Sjodin et al. (2012). 



5 Discussion 



We have presented a method for validating whether an ABC analysis contains significant 
approximation error based on assessment of the coverage property. The method can be used 
to determine the kernel scale parameter, e, via simple dagnostic plots. We have used this 



method in a re-analysis of human demographic data (Sjodin et al. 2012), validating the 



choice of e and the use of regression-adjustment post-processing in that study. 



Our methodology draws on several previous approaches. In particular, Wegmann et al. 



(2009, 2010) use a similar scheme for validating ABC parameter inference, and suggest using 



the Y diagnostic statistic (Q. Also, Cook et al. (2006) employ a similar idea for Bayesian 
software testing using the X 2 statistic (|6|). Our contribution here is to provide: results on 
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the choice of (m ,9 ,y ) samples to use, a description of a general purpose methodology 
(and R code to implement it), and evidence that many diagnostic statistics are not fully 
trustworthy and should be supplemented with diagnostic plots. We also extend the coverage 
property definition and the validation methodology to cover model inference, incorporating 



ideas from Seillier-Moiseiwitsch and Dawid (1993) 



Our approach only aims to determine whether vrABc( m , ^|z/obs) is a good approximation 
of 7r(m, 9\y b B ), or whether 7Tabc(^, d\s bs) is a good approximation of 7r(m, 0|s o b s )- 111 order 
to use the coverage property to assess the approximation impact of summary statistics, in 
addition to that of e, it would be necessary to choose the set V to consist of those (m, 9, y) that 
minimise \\y — y bs\\, and then perform the rest of the analysis based on using (m,9, S(y)) 
rather than (m,9,y). Further investigation would be required to see if this is a practical 
approach. 

One word of caution: when a summary statistic is not informative for a parameter, so that 
^abc(^\ s ) — Tt(0\s) = tt(9), then our diagnostics will support a good posterior approximation 
for any value of e. This is, of course, the correct result, however it should not be misconstrued 
as any support of information content in s for 9. Additionally, the above diagnostics are 
evaluated for each parameter separately within a multivariate parameter 9. Hence, there is 
the possibility of diagnosing a good posterior approximation for all posterior margins, but 
not for the joint distribution of model parameters within any model. This could be resolved 
by constructing a suitable multivariate diagnostic and test. 

R code for implemeting Algorithm [2] can be found in the abc package, which is freely 
available on the CRAN. 

Preprint note: The R code is currently being incorporated into the above package. In the 
meantime, it is directly available /rom[http : //www . maths . lanes . ac . uk/ ^prangle/pub . html 
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Appendix: Proofs 



Here we provide proofs of Results [l[j6] presented in Section [2} Denote by F(m, 9, y) the joint 
distribution function defined by n(y\0, m)ir(9\m)p(m). We will also use F to denote the 
associated marginal and conditional distributions. 

Proof of Result [1} We have g(9\y) = n(9\y) and h(9 \yo) = n(9 \y ). In this case, 
G y {9) = F(9\y) = H(9\y). Hence 

Pr(9 G C(y ,a)\y ) = Pv(G yo (9 ) G B(a)\y ) = a 
Pr(9 e C(y ,a)) = E Giyo) [Pr(0 o G C(y ,a)\y )] = a. 

Proof of Result g We have g(9\y) = ir(9) and h(9 ) = ir(9 Q ). In this case, G y {9) = 
F(9) = H(9). Hence 

Pr(9 G C(y ,a)) = Pv(G yo (9 ) G 5(a)) = a. 

Proof of Result |3f First assume that coverage holds. Let B(a) = [0,a). Then 

a = Pr(6* G C(yo,a)) = Pr(p G [0,a)), (applying G yo to the event) 

so the distribution function of po equals that of a £7(0, 1) distribution. For the converse, now 
assume that p ~ U (0, 1). Then 

a = Pr(p G B(a)) = Pr(6> G C(?/o ; a)), (applying C?" 1 to the event) 

which is the condition needed for coverage. 

Proof of Result [4| We have g(m|?/) = p(m\y) and /i(mo||/o) = P^olz/o)- in this case, 
g(m\y) = p(m\y) = h(m\y). Fix some i and I such that ^ does not hold, and write A for 
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A(i,I). Then for y E A, h(i\y ) = g{i\y ) € I. Thus Pr(m = i\y E A) = E H{yo) [h(i\y )} E 
I, where H(y ) denotes the marginal distribution of H. Hence Q holds. 

Proof of Result [5j We have g(m\y) = p(m) and h(m ) = p(mo). In this case, g(m\y) = 
p(m) = h(m). Fix some i and / and write A for A(i,I). Consider first that p(i) ^ /. Then 
A = and so ^ holds. Suppose instead that p(i) E I. Then A is the set of all possible y 
values, and Pr(m = i\y E A) = Pr(m = i) = p{i) E I. Hence Q holds. 

Proof of Result [6j Assume that, for all i, ([5J holds for almost all w (with respect to 

as defined in the statement of this result). Fix some i and J, and consider the case where 

(|3j) does not hold. Then 

Pr(m = i\yo E A(i, I)) = Pr(m = i\z E I) 

= E z , [zo) \p(i\z )) 
= Hz'(zo)[zo) e /, 

where Z'(zq) is the marginal distribution of zq truncated to /. Thus, coverage holds. 

Next assume coverage with respect to H, and fix some i. For any w such that Pr(zo = 
w) > 0, it is immediate that ^ follows. Define I £ (w) = [w — e, w + e] fl [0, 1]. It suffices 
to prove that (|5| holds for w such that Pr(^o E / e (w)) > for all e > 0. (The set of 
other w values has probability zero, as each w lies within an interval of zero probability.) 
Fix such a w with Pr(zo = w) = 0. Note that zq E I e {w) represents the same event as 
yo E A £ := A(i, I £ (w)). From the assumption on w, ^ is false for / = I e (w) and e > 0. 
Hence ^ must hold i.e. 

Pr(m = i\yo E A e ) E I £ (w) for e > 0. 
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The left hand side of this equals 

Pr(m = i\z E I e (w)) = E z , (zo) [Pr(m = i\z )], 

where Z' e (z ) is the marginal distribution of z truncated to I £ {w). Thus 

E^(^ )[ Pr ( m o = i\zo)] e h{w). 

It follows by the Lebesgue differentiation theorem, using the assumption on zq assumed in 
the statement of Result |6j that Pr(m = i\z = w) = w for almost all w, as required. 
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Figure 2: Histograms of the c = 200 po values for the parameter g in the N(Q, 1) / g-and-fc example, for 
e = 0.28, 1.5, 13. In the top panels V is drawn from the prior; in the bottom panels V is drawn from the 
truncated prior. Columns indicate different e values. 




Figure 3: Model inference diagnostics in the N(0, 1) / g-and-k example, for e = 0.28, 1.5, 13. In the top 
panels V is drawn from the prior; in the bottom panels V is drawn from the truncated prior. Columns indicate 
different e values. Each panel shows the observed and predicted (under coverage) model probabilities for the 
N(0, 1) model, including a 95% credible interval for the predictions. 
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Figure 4: Plots of e against coverage p- values for the parameters d, Nq, Na and Tf, in the human de- 
mographic history analysis. Regression-adjusted post-processing is not implemented. Rows correspond to 
individual parameters; columns correspond to the three models. 
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Figure 5: Plots of e against coverage p- values for the parameters d, Nq, Na and Tf, in the human demo- 
graphic history analysis. Regression- adjusted post-processing has been implemented. Rows correspond to 
individual parameters; columns correspond to the three models. 
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Figure 6: Plots of e against coverage p- values in the human demographic history analysis. Rows correspond 
to the three models; columns correspond to the implementation of regression- adjustment post-processing. 
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Figure 10: (Supplementary Figure 3): As for Figure 4 (main text), but for the remaining 
model parameters T g , b, Nb, Td ur and n. 
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Figure 11: (Supplementary Figure 4): As for Figure 5 (main text), but for the remaining 
model parameters T g , b, Nb, Td ur and n. 
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Figure 12: (Supplementary Figure 5): Histograms of the c = 200 po values for the parameters d, N , Na 
and Tb in the human demographic history analysis, with e = 1.36. Regression-adjusted post-processing is 
not implemented. Rows correspond to individual parameters; columns correspond to the three models. 
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Figure 13: (Supplementary Figure 6): Histograms of the c = 200 po values for the parameters d, N , Na 
and Tfc in the human demographic history analysis, with e — 1.36. Regression-adjusted post-processing has 
been implemented. Rows correspond to individual parameters; columns correspond to the three models. 
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